EnzyACT: A Novel Deep Learning Method to Predict the Impacts of Single and Multiple Mutations on Enzyme Activity

Enzyme engineering involves the customization of enzymes by introducing mutations to expand the application scope of natural enzymes. One limitation of that is the complex interaction between two key properties, activity and stability, where the enhancement of one often leads to the reduction of the other, also called the trade-off mechanism. Although dozens of methods that predict the change of protein stability upon mutations have been developed, the prediction of the effect on activity is still in its early stage. Therefore, developing a fast and accurate method to predict the impact of the mutations on enzyme activity is helpful for enzyme design and understanding of the trade-off mechanism. Here, we introduce a novel approach, EnzyACT, a deep learning method that fuses graph technique and protein embedding to predict activity changes upon single or multiple mutations. Our model combines graph-based techniques and language models to predict the activity changes. Moreover, EnzyACT is trained on a new curated data set including both single- and multiple-point mutations. When benchmarked on multiple independent data sets, it shows uniform performance on problems affected by mutations. This work also provides insights into the impact of distant mutations within activity design, which could also be useful for predicting catalytic residues and developing improved enzyme-engineering strategies.


Graph Convolution Network Architecture
In the protein, the nodes represent the amino acids of which proteins are made up of, and the interactions between residues make up the adjacency matrix.In this work, we use the class of dgl.nn.pytorch.conv.GraphConv in DGL (version 1.1.0) 1,2to develop our GCN model.

Node features
Graph-based protein node features usually adopt one-hot to encode the characteristic of each node, which can be physicochemical properties or evolutionary information.In this work, we applied the protein sequence embeddings generated by the ProtT5-XL-Uniref50 pretrained model as node features.ProtT5-XL-Uniref50 pretrained model was developed by Ahmed et al. 3 , which was trained on 450M protein sequences by using the T5 architecture with 3B parameters.It achieves state-of-the-art results in multiple downstream tasks compared with other popular protein language models 4 .The traditional way to obtain the node features is based on the sequence, and selected several amino acids from both the left and right of the mutation site as nodes 5 .Obviously, it is more reasonable to construct nodes through space because mutations change the surrounding interactions.One mutation is represented as the concatenating of wild sequence embedding and that of mutant sequence.This means, the node feature is a matrix of N*2048, where N is the set of residues less than 12 Å away from the mutation site and F 0 =2048 is the twice size of hidden layers.We denote the matrix as mutational embedding which represents the mutational effect on the whole sequence.

Spatial adjacency matrix (SAM)
Adjacency matrix indicates whether pairs of vertices are adjacent or not in the graph.This theory is based on the protein structure as a network, in which each amino acid is directly connected according to a certain relationship, the connection between each node and other nodes is the interaction between an amino acid and other surrounding amino acids.For geometric models, we defined the Cα atom of each residue as a node, and edges were drawn between nodes if they were within 10 Å from each other.Mutational embedding corresponding to each residue was then assigned to the respective node on the protein graph.

Additional knowledge-based features (AKB)
A set of additional knowledge regarding the environmental characteristics of the wild-type residue (e.g., relative solvent accessibility, residue depth and secondary structure) was added to fully connected layers.

Training details
The Graph Convolutional Network architectures of EnzyACT used the protein embeddings, spatial adjacency matrix and additional knowledge-based features as the input to train the model.

Symmetry test method
To evaluate whether the results of direct and reverse mutation predictions are symmetric, we defined a bias score in reference to stability prediction methods:

𝑁
Assume that the result of increased activity is 1, then the result of decreased activity is -1, a perfectly antisymmetric and unbiased method should have δ equal to 0.

Molecular Dynamics Simulation Protocol
The MD simulations of the constructed systems were performed by using the NAMD software package 13 with AMBER ff19SB force field 14 .Both the wild-type and mutant were embedded in a box-shaped (95 × the atom with the largest coordinate in that direction.The system was neutralized with sodium cations or chloride anions.Na+ and Cl− ion pairs were then added to reach a physiological salt concentration of 0.15 M. Approximately 82,300 atoms were used for each system.The solvated protein was equilibrated by carrying out a series of 4,000 steps of energy minimization with 10 kcal/mol/Å 2 restraints on the backbone, after 4,000 steps of minimization without restraints, 596 ps of heating restricted 2 kcal/mol/Å 2 on the backbone from 0 to 298 K, after heating, gradually decrease the restricted force within 500ps until 0, 1 ns of density equilibration with NVT followed by 500 ns of constant pressure equilibration at 298 K.The system was equilibrated using an NPT ensemble at 298 K and pressure at 1 atm (1 atm = 101.3kPa).All the simulations were run with SHAKE on hydrogen atoms, a 2 fs time step and a Langevin thermostat for temperature control and pressure control.Periodic boundary conditions and the Particle-Mesh-Ewald (PME) 15 algorithm were adopted to compute the long range electrostatic forces, and the cutoff was set as 10 Å. Trajectory frames were collected at every 8 ps for a total of 500 ns.All results are trained on S10998 and tested on S6103."Sequence" means using the left and right N amino acids near the mutation site on the sequence to replace the amino acids near the mutation site in space (SAM), and updating the corresponding adjacent matrix."No" means that this feature is not used for training model.

Figure S1 .
Figure S1.Ratio of different enzyme types in (A) training set and (B) test set (S2814).

Figure S3 .
Figure S3.EnzyACT performance for node features of different sizes.

Table S1 .
Training set and test set used for model development.

Table S2 .
Training set and test set used for model development.
Figure S2.Distribution of mutation number in training set and test set.Table S3.Features used in the multiple point mutation prediction Feature Dimension Description Individual Activity change 3 Max, minimum, and mean value of activity change upon each single-point mutation predicted by model AAIndex2 92 Scores from substitution tables Mutation distance 1 Mean value of the 3D distance between all mutation sites

Table S4 .
The performance of different secondary structures between the two methods on S2841 dataset

Table S5 .
The performance of different secondary structures between the two methods on P450 dataset

Table S6 .
Ablation study of EnzyACT

Table S7 .
Performance of different graph network models on the same data.

Table S12 .
Performance of the final model on datasets with different numbers of mutations

Table S13 .
Information about incomparable tools for enzyme activity prediction

Table S15 .
5-fold cross validation results for machine learning methods on balanced dataset.

Table S16 .
EnzyACT trained on balanced and unbalanced data set and tested on the S2814 dataset.